!pip install pandas
!pip install seaborn
!pip install matplotlib
Requirement already satisfied: pandas in ./.venv/lib/python3.11/site-packages (2.2.0) Requirement already satisfied: numpy<2,>=1.23.2 in ./.venv/lib/python3.11/site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.11/site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Collecting lxml Downloading lxml-5.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB) Downloading lxml-5.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 8.1 MB/s eta 0:00:0000:0100:01 Installing collected packages: lxml Successfully installed lxml-5.1.0 Collecting lightgbm Downloading lightgbm-4.3.0-py3-none-manylinux_2_28_x86_64.whl.metadata (19 kB) Requirement already satisfied: numpy in ./.venv/lib/python3.11/site-packages (from lightgbm) (1.26.4) Collecting scipy (from lightgbm) Downloading scipy-1.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 kB 881.3 kB/s eta 0:00:00 0:00:01 Downloading lightgbm-4.3.0-py3-none-manylinux_2_28_x86_64.whl (3.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 4.8 MB/s eta 0:00:00a 0:00:01 Downloading scipy-1.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.4/38.4 MB 1.1 MB/s eta 0:00:0000:0100:010m Installing collected packages: scipy, lightgbm Successfully installed lightgbm-4.3.0 scipy-1.12.0 Requirement already satisfied: pandas in ./.venv/lib/python3.11/site-packages (2.2.0) Requirement already satisfied: numpy<2,>=1.23.2 in ./.venv/lib/python3.11/site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.11/site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Requirement already satisfied: numpy in ./.venv/lib/python3.11/site-packages (1.26.4) Collecting seaborn Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB) Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./.venv/lib/python3.11/site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in ./.venv/lib/python3.11/site-packages (from seaborn) (2.2.0) Collecting matplotlib!=3.6.1,>=3.4 (from seaborn) Downloading matplotlib-3.8.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB) Collecting contourpy>=1.0.1 (from matplotlib!=3.6.1,>=3.4->seaborn) Downloading contourpy-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB) Collecting cycler>=0.10 (from matplotlib!=3.6.1,>=3.4->seaborn) Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB) Collecting fonttools>=4.22.0 (from matplotlib!=3.6.1,>=3.4->seaborn) Downloading fonttools-4.49.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (159 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.1/159.1 kB 1.3 MB/s eta 0:00:00a 0:00:01 Collecting kiwisolver>=1.3.1 (from matplotlib!=3.6.1,>=3.4->seaborn) Downloading kiwisolver-1.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB) Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.2) Collecting pillow>=8 (from matplotlib!=3.6.1,>=3.4->seaborn) Downloading pillow-10.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.7 kB) Collecting pyparsing>=2.3.1 (from matplotlib!=3.6.1,>=3.4->seaborn) Downloading pyparsing-3.1.1-py3-none-any.whl.metadata (5.1 kB) Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0) Downloading seaborn-0.13.2-py3-none-any.whl (294 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.9/294.9 kB 21.9 MB/s eta 0:00:00 Downloading matplotlib-3.8.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 2.8 MB/s eta 0:00:0000:0100:01 Downloading contourpy-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 313.4/313.4 kB 2.1 MB/s eta 0:00:00a 0:00:01m Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB) Downloading fonttools-4.49.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 MB 2.6 MB/s eta 0:00:0000:0100:01m Downloading kiwisolver-1.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 1.4 MB/s eta 0:00:00a 0:00:01 Downloading pillow-10.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 748.2 kB/s eta 0:00:0000:0100:01 Downloading pyparsing-3.1.1-py3-none-any.whl (103 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.1/103.1 kB 1.3 MB/s eta 0:00:0000:010:01 Installing collected packages: pyparsing, pillow, kiwisolver, fonttools, cycler, contourpy, matplotlib, seaborn Successfully installed contourpy-1.2.0 cycler-0.12.1 fonttools-4.49.0 kiwisolver-1.4.5 matplotlib-3.8.3 pillow-10.2.0 pyparsing-3.1.1 seaborn-0.13.2 Requirement already satisfied: matplotlib in ./.venv/lib/python3.11/site-packages (3.8.3) Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.11/site-packages (from matplotlib) (1.2.0) Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.11/site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.11/site-packages (from matplotlib) (4.49.0) Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib) (1.4.5) Requirement already satisfied: numpy<2,>=1.21 in ./.venv/lib/python3.11/site-packages (from matplotlib) (1.26.4) Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from matplotlib) (23.2) Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.11/site-packages (from matplotlib) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.11/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Collecting sklearn Downloading sklearn-0.0.post12.tar.gz (2.6 kB) Installing build dependencies ... done Getting requirements to build wheel ... error error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> [15 lines of output] The 'sklearn' PyPI package is deprecated, use 'scikit-learn' rather than 'sklearn' for pip commands. Here is how to fix this error in the main use cases: - use 'pip install scikit-learn' rather than 'pip install sklearn' - replace 'sklearn' by 'scikit-learn' in your pip requirements files (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...) - if the 'sklearn' package is used by one of your dependencies, it would be great if you take some time to track which package uses 'sklearn' instead of 'scikit-learn' and report it to their issue tracker - as a last resort, set the environment variable SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error More information is available at https://github.com/scikit-learn/sklearn-pypi-package [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error × Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. Collecting phik Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB) Requirement already satisfied: numpy>=1.18.0 in ./.venv/lib/python3.11/site-packages (from phik) (1.26.4) Requirement already satisfied: scipy>=1.5.2 in ./.venv/lib/python3.11/site-packages (from phik) (1.12.0) Requirement already satisfied: pandas>=0.25.1 in ./.venv/lib/python3.11/site-packages (from phik) (2.2.0) Requirement already satisfied: matplotlib>=2.2.3 in ./.venv/lib/python3.11/site-packages (from phik) (3.8.3) Collecting joblib>=0.14.1 (from phik) Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB) Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (1.2.0) Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (4.49.0) Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (1.4.5) Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (23.2) Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (2.8.2) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas>=0.25.1->phik) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas>=0.25.1->phik) (2024.1) Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib>=2.2.3->phik) (1.16.0) Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (687 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 687.8/687.8 kB 2.6 MB/s eta 0:00:0000:0100:01 Downloading joblib-1.3.2-py3-none-any.whl (302 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.2/302.2 kB 20.3 MB/s eta 0:00:00 Installing collected packages: joblib, phik Successfully installed joblib-1.3.2 phik-0.12.4 ERROR: Could not find a version that satisfies the requirement random (from versions: none) ERROR: No matching distribution found for random Collecting scikit-learn Downloading scikit_learn-1.4.1.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB) Requirement already satisfied: numpy<2.0,>=1.19.5 in ./.venv/lib/python3.11/site-packages (from scikit-learn) (1.26.4) Requirement already satisfied: scipy>=1.6.0 in ./.venv/lib/python3.11/site-packages (from scikit-learn) (1.12.0) Requirement already satisfied: joblib>=1.2.0 in ./.venv/lib/python3.11/site-packages (from scikit-learn) (1.3.2) Collecting threadpoolctl>=2.0.0 (from scikit-learn) Downloading threadpoolctl-3.3.0-py3-none-any.whl.metadata (13 kB) Downloading scikit_learn-1.4.1.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 7.9 MB/s eta 0:00:0000:0100:01 Downloading threadpoolctl-3.3.0-py3-none-any.whl (17 kB) Installing collected packages: threadpoolctl, scikit-learn Successfully installed scikit-learn-1.4.1.post1 threadpoolctl-3.3.0 Collecting tensorflow Downloading tensorflow-2.15.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB) Collecting absl-py>=1.0.0 (from tensorflow) Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB) Collecting astunparse>=1.6.0 (from tensorflow) Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB) Collecting flatbuffers>=23.5.26 (from tensorflow) Downloading flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes) Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow) Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB) Collecting google-pasta>=0.1.1 (from tensorflow) Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.5/57.5 kB 552.4 kB/s eta 0:00:00:--:-- Collecting h5py>=2.9.0 (from tensorflow) Downloading h5py-3.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB) Collecting libclang>=13.0.0 (from tensorflow) Downloading libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB) Collecting ml-dtypes~=0.2.0 (from tensorflow) Downloading ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB) Requirement already satisfied: numpy<2.0.0,>=1.23.5 in ./.venv/lib/python3.11/site-packages (from tensorflow) (1.26.4) Collecting opt-einsum>=2.3.2 (from tensorflow) Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.5/65.5 kB 999.4 kB/s eta 0:00:00 0:00:01 Requirement already satisfied: packaging in ./.venv/lib/python3.11/site-packages (from tensorflow) (23.2) Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow) Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes) Requirement already satisfied: setuptools in ./.venv/lib/python3.11/site-packages (from tensorflow) (65.5.0) Requirement already satisfied: six>=1.12.0 in ./.venv/lib/python3.11/site-packages (from tensorflow) (1.16.0) Collecting termcolor>=1.1.0 (from tensorflow) Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB) Collecting typing-extensions>=3.6.6 (from tensorflow) Downloading typing_extensions-4.9.0-py3-none-any.whl.metadata (3.0 kB) Collecting wrapt<1.15,>=1.11.0 (from tensorflow) Downloading wrapt-1.14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB) Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow) Downloading tensorflow_io_gcs_filesystem-0.36.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB) Collecting grpcio<2.0,>=1.24.3 (from tensorflow) Downloading grpcio-1.60.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB) Collecting tensorboard<2.16,>=2.15 (from tensorflow) Downloading tensorboard-2.15.2-py3-none-any.whl.metadata (1.7 kB) Collecting tensorflow-estimator<2.16,>=2.15.0 (from tensorflow) Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl.metadata (1.3 kB) Collecting keras<2.16,>=2.15.0 (from tensorflow) Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB) Collecting wheel<1.0,>=0.23.0 (from astunparse>=1.6.0->tensorflow) Downloading wheel-0.42.0-py3-none-any.whl.metadata (2.2 kB) Collecting google-auth<3,>=1.6.3 (from tensorboard<2.16,>=2.15->tensorflow) Downloading google_auth-2.28.0-py2.py3-none-any.whl.metadata (4.7 kB) Collecting google-auth-oauthlib<2,>=0.5 (from tensorboard<2.16,>=2.15->tensorflow) Downloading google_auth_oauthlib-1.2.0-py2.py3-none-any.whl.metadata (2.7 kB) Collecting markdown>=2.6.8 (from tensorboard<2.16,>=2.15->tensorflow) Downloading Markdown-3.5.2-py3-none-any.whl.metadata (7.0 kB) Collecting requests<3,>=2.21.0 (from tensorboard<2.16,>=2.15->tensorflow) Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB) Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.16,>=2.15->tensorflow) Downloading tensorboard_data_server-0.7.2-py3-none-any.whl.metadata (1.1 kB) Collecting werkzeug>=1.0.1 (from tensorboard<2.16,>=2.15->tensorflow) Downloading werkzeug-3.0.1-py3-none-any.whl.metadata (4.1 kB) Collecting cachetools<6.0,>=2.0.0 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) Downloading cachetools-5.3.2-py3-none-any.whl.metadata (5.2 kB) Collecting pyasn1-modules>=0.2.1 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) Downloading pyasn1_modules-0.3.0-py2.py3-none-any.whl.metadata (3.6 kB) Collecting rsa<5,>=3.1.4 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) Downloading rsa-4.9-py3-none-any.whl (34 kB) Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) Downloading requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB) Collecting charset-normalizer<4,>=2 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) Downloading charset_normalizer-3.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (33 kB) Collecting idna<4,>=2.5 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB) Collecting urllib3<3,>=1.21.1 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB) Collecting certifi>=2017.4.17 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB) Collecting MarkupSafe>=2.1.1 (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow) Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB) Collecting pyasn1<0.6.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) Downloading pyasn1-0.5.1-py2.py3-none-any.whl.metadata (8.6 kB) Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) Downloading oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB) Downloading tensorflow-2.15.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 475.3/475.3 MB 1.3 MB/s eta 0:00:0000:01m0:10m Downloading absl_py-2.1.0-py3-none-any.whl (133 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.7/133.7 kB 2.3 MB/s eta 0:00:00 0:00:01 Downloading flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB) Downloading gast-0.5.4-py3-none-any.whl (19 kB) Downloading grpcio-1.60.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.4/5.4 MB 1.7 MB/s eta 0:00:0000:0100:01 Downloading h5py-3.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 1.2 MB/s eta 0:00:0000:0100:01 Downloading keras-2.15.0-py3-none-any.whl (1.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 918.5 kB/s eta 0:00:00a 0:00:01 Downloading libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.9/22.9 MB 4.2 MB/s eta 0:00:0000:0100:01 Downloading ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 5.0 MB/s eta 0:00:00a 0:00:01m Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.6/294.6 kB 6.0 MB/s eta 0:00:00a 0:00:01 Downloading tensorboard-2.15.2-py3-none-any.whl (5.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 5.2 MB/s eta 0:00:0000:0100:01 Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl (441 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 442.0/442.0 kB 4.4 MB/s eta 0:00:00a 0:00:01m Downloading tensorflow_io_gcs_filesystem-0.36.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.1/5.1 MB 3.8 MB/s eta 0:00:0000:0100:01m Downloading termcolor-2.4.0-py3-none-any.whl (7.7 kB) Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB) Downloading wrapt-1.14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (78 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.4/78.4 kB 5.9 MB/s eta 0:00:00 Downloading google_auth-2.28.0-py2.py3-none-any.whl (186 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 186.9/186.9 kB 4.3 MB/s eta 0:00:0000:01 Downloading google_auth_oauthlib-1.2.0-py2.py3-none-any.whl (24 kB) Downloading Markdown-3.5.2-py3-none-any.whl (103 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.9/103.9 kB 16.4 MB/s eta 0:00:00 Downloading requests-2.31.0-py3-none-any.whl (62 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 5.2 MB/s eta 0:00:00 Downloading tensorboard_data_server-0.7.2-py3-none-any.whl (2.4 kB) Downloading werkzeug-3.0.1-py3-none-any.whl (226 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 226.7/226.7 kB 5.4 MB/s eta 0:00:0000:01 Downloading wheel-0.42.0-py3-none-any.whl (65 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.4/65.4 kB 6.8 MB/s eta 0:00:00 Downloading cachetools-5.3.2-py3-none-any.whl (9.3 kB) Downloading certifi-2024.2.2-py3-none-any.whl (163 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.8/163.8 kB 5.3 MB/s eta 0:00:00 Downloading charset_normalizer-3.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.3/140.3 kB 6.8 MB/s eta 0:00:00 Downloading idna-3.6-py3-none-any.whl (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 3.3 MB/s eta 0:00:00 Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) Downloading pyasn1_modules-0.3.0-py2.py3-none-any.whl (181 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 181.3/181.3 kB 10.6 MB/s eta 0:00:00 Downloading urllib3-2.2.1-py3-none-any.whl (121 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.1/121.1 kB 2.5 MB/s eta 0:00:00a 0:00:01 Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 151.7/151.7 kB 6.4 MB/s eta 0:00:00 Downloading pyasn1-0.5.1-py2.py3-none-any.whl (84 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.9/84.9 kB 6.7 MB/s eta 0:00:00 Installing collected packages: libclang, flatbuffers, wrapt, wheel, urllib3, typing-extensions, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server, pyasn1, protobuf, opt-einsum, oauthlib, ml-dtypes, MarkupSafe, markdown, keras, idna, h5py, grpcio, google-pasta, gast, charset-normalizer, certifi, cachetools, absl-py, werkzeug, rsa, requests, pyasn1-modules, astunparse, requests-oauthlib, google-auth, google-auth-oauthlib, tensorboard, tensorflow Successfully installed MarkupSafe-2.1.5 absl-py-2.1.0 astunparse-1.6.3 cachetools-5.3.2 certifi-2024.2.2 charset-normalizer-3.3.2 flatbuffers-23.5.26 gast-0.5.4 google-auth-2.28.0 google-auth-oauthlib-1.2.0 google-pasta-0.2.0 grpcio-1.60.1 h5py-3.10.0 idna-3.6 keras-2.15.0 libclang-16.0.6 markdown-3.5.2 ml-dtypes-0.2.0 oauthlib-3.2.2 opt-einsum-3.3.0 protobuf-4.25.3 pyasn1-0.5.1 pyasn1-modules-0.3.0 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 tensorboard-2.15.2 tensorboard-data-server-0.7.2 tensorflow-2.15.0.post1 tensorflow-estimator-2.15.0 tensorflow-io-gcs-filesystem-0.36.0 termcolor-2.4.0 typing-extensions-4.9.0 urllib3-2.2.1 werkzeug-3.0.1 wheel-0.42.0 wrapt-1.14.1
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from phik.report import plot_correlation_matrix
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')
test
| session_id | site1 | time1 | site2 | time2 | site3 | time3 | site4 | time4 | site5 | ... | site6 | time6 | site7 | time7 | site8 | time8 | site9 | time9 | site10 | time10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 29 | 2014-10-04 | 35 | 2014-10-04 | 22 | 2014-10-04 | 321 | 2014-10-04 | 23 | ... | 2211 | 2014-10-04 | 6730 | 2014-10-04 | 21 | 2014-10-04 | 44582 | 2014-10-04 | 15336 | 2014-10-04 |
| 1 | 2 | 782 | 2014-07-03 | 782 | 2014-07-03 | 782 | 2014-07-03 | 782 | 2014-07-03 | 782 | ... | 782 | 2014-07-03 | 782 | 2014-07-03 | 782 | 2014-07-03 | 782 | 2014-07-03 | 782 | 2014-07-03 |
| 2 | 3 | 55 | 2014-12-05 | 55 | 2014-12-05 | 55 | 2014-12-05 | 55 | 2014-12-05 | 55 | ... | 55 | 2014-12-05 | 55 | 2014-12-05 | 55 | 2014-12-05 | 1445 | 2014-12-05 | 1445 | 2014-12-05 |
| 3 | 4 | 1023 | 2014-11-04 | 1022 | 2014-11-04 | 50 | 2014-11-04 | 222 | 2014-11-04 | 202 | ... | 3374 | 2014-11-04 | 50 | 2014-11-04 | 48 | 2014-11-04 | 48 | 2014-11-04 | 3374 | 2014-11-04 |
| 4 | 5 | 301 | 2014-05-16 | 301 | 2014-05-16 | 301 | 2014-05-16 | 66 | 2014-05-16 | 67 | ... | 69 | 2014-05-16 | 70 | 2014-05-16 | 68 | 2014-05-16 | 71 | 2014-05-16 | 167 | 2014-05-16 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82792 | 82793 | 812 | 2014-10-02 | 1039 | 2014-10-02 | 676 | 2014-10-02 | 3 | 2014-05-27 | 167 | ... | 45064 | 2014-05-27 | 45065 | 2014-05-27 | 384 | 2014-05-27 | 23 | 2014-05-27 | 3346 | 2014-05-27 |
| 82793 | 82794 | 300 | 2014-05-26 | 302 | 2014-05-26 | 302 | 2014-05-26 | 300 | 2014-05-26 | 300 | ... | 1222 | 2014-05-26 | 302 | 2014-05-26 | 1218 | 2014-05-26 | 1221 | 2014-05-26 | 1216 | 2014-05-26 |
| 82794 | 82795 | 29 | 2014-05-02 | 33 | 2014-05-02 | 35 | 2014-05-02 | 22 | 2014-05-02 | 37 | ... | 6779 | 2014-05-02 | 30 | 2014-05-02 | 21 | 2014-05-02 | 23 | 2014-05-02 | 6780 | 2014-05-02 |
| 82795 | 82796 | 5828 | 2014-05-03 | 23 | 2014-05-03 | 21 | 2014-05-03 | 804 | 2014-05-03 | 21 | ... | 3350 | 2014-05-03 | 23 | 2014-05-03 | 894 | 2014-05-03 | 21 | 2014-05-03 | 961 | 2014-05-03 |
| 82796 | 82797 | 21 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | ... | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 |
82797 rows × 21 columns
train
full_df = pd.concat([train.drop('target', axis=1), test])
2.1 Визуальный анализ данных¶
# Напишем функцию, которая принимает на вход DataFrame, кодирует числовыми значениями категориальные признаки
# и возвращает обновленный DataFrame и сами кодировщики.
def number_encode_features(init_df):
result = init_df.copy() # копируем нашу исходную таблицу
encoders = {}
for column in result.columns:
if result.dtypes[column] == object: # np.object -- строковый тип / если тип столбца - строка, то нужно его закодировать
encoders[column] = preprocessing.LabelEncoder() # для колонки column создаем кодировщик
result[column] = encoders[column].fit_transform(result[column]) # применяем кодировщик к столбцу и перезаписываем столбец
return result, encoders
encoded_full, encoders = number_encode_features(full_df) # Теперь encoded data содержит закодированные кат. признаки
encoded_full.head()
| session_id | site1 | time1 | site2 | time2 | site3 | time3 | site4 | time4 | site5 | ... | site6 | time6 | site7 | time7 | site8 | time8 | site9 | time9 | site10 | time10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 890 | 80 | 941 | 79 | 3847 | 79 | 941 | 78 | 942 | ... | 3846 | 78 | 3847 | 78 | 3846 | 78 | 1516 | 78 | 1518 | 78 |
| 1 | 3 | 14769 | 31 | 39 | 31 | 14768 | 31 | 14769 | 31 | 37 | ... | 39 | 31 | 14768 | 31 | 14768 | 31 | 14768 | 31 | 14768 | 31 |
| 2 | 4 | 782 | 106 | 782 | 105 | 782 | 104 | 782 | 103 | 782 | ... | 782 | 103 | 782 | 103 | 782 | 103 | 782 | 103 | 782 | 103 |
| 3 | 5 | 22 | 86 | 177 | 85 | 175 | 85 | 178 | 84 | 177 | ... | 178 | 84 | 175 | 84 | 177 | 84 | 177 | 84 | 178 | 84 |
| 4 | 6 | 570 | 96 | 21 | 95 | 570 | 94 | 21 | 93 | 21 | ... | 178 | 84 | 175 | 84 | 177 | 84 | 177 | 84 | 178 | 84 |
5 rows × 21 columns
будем использовать визализацию Корреляция признаков по этой визуализации можно увидеть как признаки зависимы друг от друга
features_target = encoded_full
interval_cols = encoded_full
phik_overview = features_target.phik_matrix(interval_cols=interval_cols)
plot_correlation_matrix(phik_overview.values,
x_labels=phik_overview.columns,
y_labels=phik_overview.index,
vmin=0, vmax=1, color_map="Greens",
title="Корреляция признаков",
fontsize_factor=1.5,
figsize=(20, 10))
plt.tight_layout()
и по этой визуализации видно что время все время зависимо друг от друга, а сайты не сильно зависимы друг от друга.
Далее будем использовать визуализацию ящик с усами она может показать как данные зависимы и как они разбросанны
sns_plot = sns.pairplot(full_df)
sns_plot.savefig('pairplot.png')
Из данной визуализации видно что данные ОЧЕНЬ очень сильно расбросанны
далее визулизация displot она может показать зависимость данных в другом виде
sns.set_theme(style="darkgrid")
df = full_df
sns.displot(
df,
binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)
<seaborn.axisgrid.FacetGrid at 0x7d8d60441790>
По данной визуализации видно сильно много по сравнению с другой визуализацией
2.2 Конструирование признаков (Feature Engineering)¶
full_df
| session_id | site1 | time1 | site2 | time2 | site3 | time3 | site4 | time4 | site5 | ... | site6 | time6 | site7 | time7 | site8 | time8 | site9 | time9 | site10 | time10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 890 | 2014-02-22 | 941 | 2014-02-22 | 3847 | 2014-02-22 | 941 | 2014-02-22 | 942 | ... | 3846 | 2014-02-22 | 3847 | 2014-02-22 | 3846 | 2014-02-22 | 1516 | 2014-02-22 | 1518 | 2014-02-22 |
| 1 | 3 | 14769 | 2013-12-16 | 39 | 2013-12-16 | 14768 | 2013-12-16 | 14769 | 2013-12-16 | 37 | ... | 39 | 2013-12-16 | 14768 | 2013-12-16 | 14768 | 2013-12-16 | 14768 | 2013-12-16 | 14768 | 2013-12-16 |
| 2 | 4 | 782 | 2014-03-28 | 782 | 2014-03-28 | 782 | 2014-03-28 | 782 | 2014-03-28 | 782 | ... | 782 | 2014-03-28 | 782 | 2014-03-28 | 782 | 2014-03-28 | 782 | 2014-03-28 | 782 | 2014-03-28 |
| 3 | 5 | 22 | 2014-02-28 | 177 | 2014-02-28 | 175 | 2014-02-28 | 178 | 2014-02-28 | 177 | ... | 178 | 2014-02-28 | 175 | 2014-02-28 | 177 | 2014-02-28 | 177 | 2014-02-28 | 178 | 2014-02-28 |
| 4 | 6 | 570 | 2014-03-18 | 21 | 2014-03-18 | 570 | 2014-03-18 | 21 | 2014-03-18 | 21 | ... | 178 | 2014-02-28 | 175 | 2014-02-28 | 177 | 2014-02-28 | 177 | 2014-02-28 | 178 | 2014-02-28 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82792 | 82793 | 812 | 2014-10-02 | 1039 | 2014-10-02 | 676 | 2014-10-02 | 3 | 2014-05-27 | 167 | ... | 45064 | 2014-05-27 | 45065 | 2014-05-27 | 384 | 2014-05-27 | 23 | 2014-05-27 | 3346 | 2014-05-27 |
| 82793 | 82794 | 300 | 2014-05-26 | 302 | 2014-05-26 | 302 | 2014-05-26 | 300 | 2014-05-26 | 300 | ... | 1222 | 2014-05-26 | 302 | 2014-05-26 | 1218 | 2014-05-26 | 1221 | 2014-05-26 | 1216 | 2014-05-26 |
| 82794 | 82795 | 29 | 2014-05-02 | 33 | 2014-05-02 | 35 | 2014-05-02 | 22 | 2014-05-02 | 37 | ... | 6779 | 2014-05-02 | 30 | 2014-05-02 | 21 | 2014-05-02 | 23 | 2014-05-02 | 6780 | 2014-05-02 |
| 82795 | 82796 | 5828 | 2014-05-03 | 23 | 2014-05-03 | 21 | 2014-05-03 | 804 | 2014-05-03 | 21 | ... | 3350 | 2014-05-03 | 23 | 2014-05-03 | 894 | 2014-05-03 | 21 | 2014-05-03 | 961 | 2014-05-03 |
| 82796 | 82797 | 21 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | ... | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 | 1098 | 2014-11-02 |
336357 rows × 21 columns
убираем '-' из данных в колонках 'time'
full_df['time1'] = full_df['time1'].str.replace('-', ' ')
full_df['time2'] = full_df['time2'].str.replace('-', ' ')
full_df['time3'] = full_df['time3'].str.replace('-', ' ')
full_df['time4'] = full_df['time4'].str.replace('-', ' ')
full_df['time5'] = full_df['time5'].str.replace('-', ' ')
full_df['time6'] = full_df['time6'].str.replace('-', ' ')
full_df['time7'] = full_df['time7'].str.replace('-', ' ')
full_df['time8'] = full_df['time8'].str.replace('-', ' ')
full_df['time9'] = full_df['time9'].str.replace('-', ' ')
full_df['time10'] = full_df['time10'].str.replace('-', ' ')
full_df
| session_id | site1 | time1 | site2 | time2 | site3 | time3 | site4 | time4 | site5 | ... | site6 | time6 | site7 | time7 | site8 | time8 | site9 | time9 | site10 | time10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 890 | 2014 02 22 | 941 | 2014 02 22 | 3847 | 2014 02 22 | 941 | 2014 02 22 | 942 | ... | 3846 | 2014 02 22 | 3847 | 2014 02 22 | 3846 | 2014 02 22 | 1516 | 2014 02 22 | 1518 | 2014 02 22 |
| 1 | 3 | 14769 | 2013 12 16 | 39 | 2013 12 16 | 14768 | 2013 12 16 | 14769 | 2013 12 16 | 37 | ... | 39 | 2013 12 16 | 14768 | 2013 12 16 | 14768 | 2013 12 16 | 14768 | 2013 12 16 | 14768 | 2013 12 16 |
| 2 | 4 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | ... | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | 2014 03 28 |
| 3 | 5 | 22 | 2014 02 28 | 177 | 2014 02 28 | 175 | 2014 02 28 | 178 | 2014 02 28 | 177 | ... | 178 | 2014 02 28 | 175 | 2014 02 28 | 177 | 2014 02 28 | 177 | 2014 02 28 | 178 | 2014 02 28 |
| 4 | 6 | 570 | 2014 03 18 | 21 | 2014 03 18 | 570 | 2014 03 18 | 21 | 2014 03 18 | 21 | ... | 178 | 2014 02 28 | 175 | 2014 02 28 | 177 | 2014 02 28 | 177 | 2014 02 28 | 178 | 2014 02 28 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82792 | 82793 | 812 | 2014 10 02 | 1039 | 2014 10 02 | 676 | 2014 10 02 | 3 | 2014 05 27 | 167 | ... | 45064 | 2014 05 27 | 45065 | 2014 05 27 | 384 | 2014 05 27 | 23 | 2014 05 27 | 3346 | 2014 05 27 |
| 82793 | 82794 | 300 | 2014 05 26 | 302 | 2014 05 26 | 302 | 2014 05 26 | 300 | 2014 05 26 | 300 | ... | 1222 | 2014 05 26 | 302 | 2014 05 26 | 1218 | 2014 05 26 | 1221 | 2014 05 26 | 1216 | 2014 05 26 |
| 82794 | 82795 | 29 | 2014 05 02 | 33 | 2014 05 02 | 35 | 2014 05 02 | 22 | 2014 05 02 | 37 | ... | 6779 | 2014 05 02 | 30 | 2014 05 02 | 21 | 2014 05 02 | 23 | 2014 05 02 | 6780 | 2014 05 02 |
| 82795 | 82796 | 5828 | 2014 05 03 | 23 | 2014 05 03 | 21 | 2014 05 03 | 804 | 2014 05 03 | 21 | ... | 3350 | 2014 05 03 | 23 | 2014 05 03 | 894 | 2014 05 03 | 21 | 2014 05 03 | 961 | 2014 05 03 |
| 82796 | 82797 | 21 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | ... | 1098 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | 2014 11 02 |
336357 rows × 21 columns
разделяем данные на несколько колонок
full_df['timef1'] = full_df['time1'].apply(lambda x: x.split()[0])
full_df['timea1'] = full_df['time1'].apply(lambda x: x.split()[1])
full_df['timef2'] = full_df['time2'].apply(lambda x: x.split()[0])
full_df['timea2'] = full_df['time2'].apply(lambda x: x.split()[1])
full_df['timef3'] = full_df['time3'].apply(lambda x: x.split()[0])
full_df['timea3'] = full_df['time3'].apply(lambda x: x.split()[1])
full_df['timef4'] = full_df['time4'].apply(lambda x: x.split()[0])
full_df['timea4'] = full_df['time4'].apply(lambda x: x.split()[1])
full_df['timef5'] = full_df['time5'].apply(lambda x: x.split()[0])
full_df['timea5'] = full_df['time5'].apply(lambda x: x.split()[1])
full_df['timef6'] = full_df['time6'].apply(lambda x: x.split()[0])
full_df['timea6'] = full_df['time6'].apply(lambda x: x.split()[1])
full_df['timef7'] = full_df['time7'].apply(lambda x: x.split()[0])
full_df['timea7'] = full_df['time7'].apply(lambda x: x.split()[1])
full_df['timef8'] = full_df['time8'].apply(lambda x: x.split()[0])
full_df['timea8'] = full_df['time8'].apply(lambda x: x.split()[1])
full_df['timef9'] = full_df['time9'].apply(lambda x: x.split()[0])
full_df['timea9'] = full_df['time9'].apply(lambda x: x.split()[1])
full_df['timef10'] = full_df['time10'].apply(lambda x: x.split()[0])
full_df['timea10'] = full_df['time10'].apply(lambda x: x.split()[1])
full_df
| session_id | site1 | time1 | site2 | time2 | site3 | time3 | site4 | time4 | site5 | ... | timef6 | timea6 | timef7 | timea7 | timef8 | timea8 | timef9 | timea9 | timef10 | timea10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 890 | 2014 02 22 | 941 | 2014 02 22 | 3847 | 2014 02 22 | 941 | 2014 02 22 | 942 | ... | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 |
| 1 | 3 | 14769 | 2013 12 16 | 39 | 2013 12 16 | 14768 | 2013 12 16 | 14769 | 2013 12 16 | 37 | ... | 2013 | 12 | 2013 | 12 | 2013 | 12 | 2013 | 12 | 2013 | 12 |
| 2 | 4 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | 2014 03 28 | 782 | ... | 2014 | 03 | 2014 | 03 | 2014 | 03 | 2014 | 03 | 2014 | 03 |
| 3 | 5 | 22 | 2014 02 28 | 177 | 2014 02 28 | 175 | 2014 02 28 | 178 | 2014 02 28 | 177 | ... | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 |
| 4 | 6 | 570 | 2014 03 18 | 21 | 2014 03 18 | 570 | 2014 03 18 | 21 | 2014 03 18 | 21 | ... | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82792 | 82793 | 812 | 2014 10 02 | 1039 | 2014 10 02 | 676 | 2014 10 02 | 3 | 2014 05 27 | 167 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82793 | 82794 | 300 | 2014 05 26 | 302 | 2014 05 26 | 302 | 2014 05 26 | 300 | 2014 05 26 | 300 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82794 | 82795 | 29 | 2014 05 02 | 33 | 2014 05 02 | 35 | 2014 05 02 | 22 | 2014 05 02 | 37 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82795 | 82796 | 5828 | 2014 05 03 | 23 | 2014 05 03 | 21 | 2014 05 03 | 804 | 2014 05 03 | 21 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82796 | 82797 | 21 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | 2014 11 02 | 1098 | ... | 2014 | 11 | 2014 | 11 | 2014 | 11 | 2014 | 11 | 2014 | 11 |
336357 rows × 41 columns
удаляем старые колонки
full_df = full_df.drop(['time1', 'time2', 'time3', 'time4', 'time5', 'time6', 'time7', 'time8', 'time9', 'time10'], axis= 1 ,)
full_df
| session_id | site1 | site2 | site3 | site4 | site5 | site6 | site7 | site8 | site9 | ... | timef6 | timea6 | timef7 | timea7 | timef8 | timea8 | timef9 | timea9 | timef10 | timea10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 890 | 941 | 3847 | 941 | 942 | 3846 | 3847 | 3846 | 1516 | ... | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 |
| 1 | 3 | 14769 | 39 | 14768 | 14769 | 37 | 39 | 14768 | 14768 | 14768 | ... | 2013 | 12 | 2013 | 12 | 2013 | 12 | 2013 | 12 | 2013 | 12 |
| 2 | 4 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | ... | 2014 | 03 | 2014 | 03 | 2014 | 03 | 2014 | 03 | 2014 | 03 |
| 3 | 5 | 22 | 177 | 175 | 178 | 177 | 178 | 175 | 177 | 177 | ... | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 |
| 4 | 6 | 570 | 21 | 570 | 21 | 21 | 178 | 175 | 177 | 177 | ... | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 | 2014 | 02 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82792 | 82793 | 812 | 1039 | 676 | 3 | 167 | 45064 | 45065 | 384 | 23 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82793 | 82794 | 300 | 302 | 302 | 300 | 300 | 1222 | 302 | 1218 | 1221 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82794 | 82795 | 29 | 33 | 35 | 22 | 37 | 6779 | 30 | 21 | 23 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82795 | 82796 | 5828 | 23 | 21 | 804 | 21 | 3350 | 23 | 894 | 21 | ... | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 | 2014 | 05 |
| 82796 | 82797 | 21 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | ... | 2014 | 11 | 2014 | 11 | 2014 | 11 | 2014 | 11 | 2014 | 11 |
336357 rows × 31 columns
соеденяем колонки
full_df['time_s1'] = full_df['timef1'] + full_df['timea1']
full_df['time_s2'] = full_df['timef2'] + full_df['timea2']
full_df['time_s3'] = full_df['timef3'] + full_df['timea3']
full_df['time_s4'] = full_df['timef4'] + full_df['timea4']
full_df['time_s5'] = full_df['timef5'] + full_df['timea5']
full_df['time_s6'] = full_df['timef6'] + full_df['timea6']
full_df['time_s7'] = full_df['timef7'] + full_df['timea7']
full_df['time_s8'] = full_df['timef8'] + full_df['timea8']
full_df['time_s9'] = full_df['timef9'] + full_df['timea9']
full_df['time_s10'] = full_df['timef10'] + full_df['timea10']
full_df
| session_id | site1 | site2 | site3 | site4 | site5 | site6 | site7 | site8 | site9 | ... | time_s1 | time_s2 | time_s3 | time_s4 | time_s5 | time_s6 | time_s7 | time_s8 | time_s9 | time_s10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 890 | 941 | 3847 | 941 | 942 | 3846 | 3847 | 3846 | 1516 | ... | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 |
| 1 | 3 | 14769 | 39 | 14768 | 14769 | 37 | 39 | 14768 | 14768 | 14768 | ... | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 |
| 2 | 4 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | ... | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 |
| 3 | 5 | 22 | 177 | 175 | 178 | 177 | 178 | 175 | 177 | 177 | ... | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 |
| 4 | 6 | 570 | 21 | 570 | 21 | 21 | 178 | 175 | 177 | 177 | ... | 201403 | 201403 | 201403 | 201403 | 201403 | 201402 | 201402 | 201402 | 201402 | 201402 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82792 | 82793 | 812 | 1039 | 676 | 3 | 167 | 45064 | 45065 | 384 | 23 | ... | 201410 | 201410 | 201410 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82793 | 82794 | 300 | 302 | 302 | 300 | 300 | 1222 | 302 | 1218 | 1221 | ... | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82794 | 82795 | 29 | 33 | 35 | 22 | 37 | 6779 | 30 | 21 | 23 | ... | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82795 | 82796 | 5828 | 23 | 21 | 804 | 21 | 3350 | 23 | 894 | 21 | ... | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82796 | 82797 | 21 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | ... | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 |
336357 rows × 41 columns
удаляем старые колонки
full_df = full_df.drop(['timef1', 'timef2', 'timef3', 'timef4', 'timef5', 'timef6', 'timef7', 'timef8', 'timef9', 'timef10','timea1', 'timea2', 'timea3', 'timea4', 'timea5', 'timea6', 'timea7', 'timea8', 'timea9', 'timea10'], axis= 1 ,)
full_df
| session_id | site1 | site2 | site3 | site4 | site5 | site6 | site7 | site8 | site9 | ... | time_s1 | time_s2 | time_s3 | time_s4 | time_s5 | time_s6 | time_s7 | time_s8 | time_s9 | time_s10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 890 | 941 | 3847 | 941 | 942 | 3846 | 3847 | 3846 | 1516 | ... | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 |
| 1 | 3 | 14769 | 39 | 14768 | 14769 | 37 | 39 | 14768 | 14768 | 14768 | ... | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 | 201312 |
| 2 | 4 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | 782 | ... | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 | 201403 |
| 3 | 5 | 22 | 177 | 175 | 178 | 177 | 178 | 175 | 177 | 177 | ... | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 | 201402 |
| 4 | 6 | 570 | 21 | 570 | 21 | 21 | 178 | 175 | 177 | 177 | ... | 201403 | 201403 | 201403 | 201403 | 201403 | 201402 | 201402 | 201402 | 201402 | 201402 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 82792 | 82793 | 812 | 1039 | 676 | 3 | 167 | 45064 | 45065 | 384 | 23 | ... | 201410 | 201410 | 201410 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82793 | 82794 | 300 | 302 | 302 | 300 | 300 | 1222 | 302 | 1218 | 1221 | ... | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82794 | 82795 | 29 | 33 | 35 | 22 | 37 | 6779 | 30 | 21 | 23 | ... | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82795 | 82796 | 5828 | 23 | 21 | 804 | 21 | 3350 | 23 | 894 | 21 | ... | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 | 201405 |
| 82796 | 82797 | 21 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | 1098 | ... | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 | 201411 |
336357 rows × 21 columns
и смотрим на результат
full_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 336357 entries, 0 to 82796 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 session_id 336357 non-null int64 1 site1 336357 non-null int64 2 site2 336357 non-null int64 3 site3 336357 non-null int64 4 site4 336357 non-null int64 5 site5 336357 non-null int64 6 site6 336357 non-null int64 7 site7 336357 non-null int64 8 site8 336357 non-null int64 9 site9 336357 non-null int64 10 site10 336357 non-null int64 11 time_s1 336357 non-null object 12 time_s2 336357 non-null object 13 time_s3 336357 non-null object 14 time_s4 336357 non-null object 15 time_s5 336357 non-null object 16 time_s6 336357 non-null object 17 time_s7 336357 non-null object 18 time_s8 336357 non-null object 19 time_s9 336357 non-null object 20 time_s10 336357 non-null object dtypes: int64(11), object(10) memory usage: 56.5+ MB
данный признак был построен для учитывания помесячного линейного тренда за весь период предоставленных данных
2.3 Подготовка отчета¶
2.1 Визуальный анализ данных
- В результате этого задания были в визуализированны зависимости атрибутов в наборе данных. Визуализация показала влияние атрибутов на целевую переменную, и данные были интерпретированны. В результате первой визуализации были показаны зависимости данных, а в результате второй визуализации мы увидели разброс данных
2.2 Конструирование признаков (Feature Engineering)
- В результате этого задания был создан признак, который будет представлять собой число вида ГГГГММ от той даты, когда проходила сессия. Таким образом, мы будем учитывать помесячный линейный тренд за весь период предоставленных данных. Были добавлены новые признаки, которые на мой взгляд позволят улучшить качество выбранной модели. Была написанна функция для создания новых признаков. Были описаны приемы генерации новых данных и результаты. В результате этого были получены более 10 новых признаков времени Признак time* был построен для учитывания помесячного линейного тренда за весь период предоставленных данных
2.3 Подготовка отчета
- В результате этого задания был создан отчет который показывает всю проделанную работу и результаты.